Video super-resolution is one of the most popular tasks on mobile devices, being widely used for an automatic improvement of low-bitrate and low-resolution video streams. While numerous solutions have been proposed for this problem, they are usually quite computationally demanding, demonstrating low FPS rates and power efficiency on mobile devices. In this Mobile AI challenge, we address this problem and propose the participants to design an end-to-end real-time video super-resolution solution for mobile NPUs optimized for low energy consumption. The participants were provided with the REDS training dataset containing video sequences for a 4X video upscaling task. The runtime and power efficiency of all models was evaluated on the powerful MediaTek Dimensity 9000 platform with a dedicated AI processing unit capable of accelerating floating-point and quantized neural networks. All proposed solutions are fully compatible with the above NPU, demonstrating an up to 500 FPS rate and 0.2 [Watt / 30 FPS] power consumption. A detailed description of all models developed in the challenge is provided in this paper.
translated by 谷歌翻译
Twitter机器人检测已成为打击错误信息,促进社交媒体节制并保持在线话语的完整性的越来越重要的任务。最先进的机器人检测方法通常利用Twitter网络的图形结构,在面对传统方法无法检测到的新型Twitter机器人时,它们表现出令人鼓舞的性能。但是,现有的Twitter机器人检测数据集很少是基于图形的,即使这些基于图形的数据集也遭受有限的数据集量表,不完整的图形结构以及低注释质量。实际上,缺乏解决这些问题的大规模基于图的Twitter机器人检测基准,严重阻碍了基于图形的机器人检测方法的开发和评估。在本文中,我们提出了Twibot-22,这是一个综合基于图的Twitter机器人检测基准,它显示了迄今为止最大的数据集,在Twitter网络上提供了多元化的实体和关系,并且与现有数据集相比具有更好的注释质量。此外,我们重新实施35代表性的Twitter机器人检测基线,并在包括Twibot-22在内的9个数据集上进行评估,以促进对模型性能和对研究进度的整体了解的公平比较。为了促进进一步的研究,我们将所有实施的代码和数据集巩固到Twibot-22评估框架中,研究人员可以在其中始终如一地评估新的模型和数据集。 Twibot-22 Twitter机器人检测基准和评估框架可在https://twibot22.github.io/上公开获得。
translated by 谷歌翻译
大规模数据集上的视觉语言预训练(VLP)在各种下游任务上表现出了首要性能。对于VLP来说,完整且公平的基准(即包括大规模的预训练数据集和各种下游任务)是必不可少的。尽管有很多具有英语语料库的基准,但使用其他语言(例如中文)为VLP建立丰富的基准是一个关键问题。为此,我们为研究界建立了一个称为零的中国跨模式基准,以比较VLP模型。我们发布两个用于下游任务的预训练数据集和五个微调数据集。旁边,我们提出了一个新的预训练前训练框架,用于跨模式学习。具体而言,我们应用全局对比度预级分别学习图像和文本的各个表示。然后,我们通过图像文本交叉编码器和文本图像交叉编码器以细粒度的排名方式融合表示形式。为了进一步增强模型的能力,我们提出了一种由目标引导的蒸馏和特征引导的蒸馏组成的双向蒸馏策略。对于简洁起见,我们将型号r2d2命名。我们在四个公共跨模式数据集和拟议的五个下游数据集上实现最先进的性能。在Flickr30k-CN,可可-CN和Muge进行零射击任务时,与最平均召回的R2D2进行了2.5亿个数据集的R2D2,在2.5亿个数据集中进行了4.7%,5.4%和6.3%的均值改善,而与最新的召回相比艺术。数据集,模型和代码可在https://github.com/yuxie11/r2d2上找到
translated by 谷歌翻译
无监督域适应(UDA)技术的最新进展在跨域计算机视觉任务中有巨大的成功,通过弥合域分布差距来增强数据驱动的深度学习架构的泛化能力。对于基于UDA的跨域对象检测方法,其中大多数通过对抗性学习策略引导域不变特征产生来缓解域偏差。然而,由于不稳定的对抗性培训过程,他们的域名鉴别器具有有限的分类能力。因此,它们引起的提取特征不能完全域不变,仍然包含域私有因素,使障碍物进一步缓解跨域差异。为了解决这个问题,我们设计一个域分离rcnn(DDF),以消除特定于检测任务学习的特定信息。我们的DDF方法促进了全局和本地阶段的功能解剖,分别具有全局三联脱离(GTD)模块和实例相似性解剖(ISD)模块。通过在四个基准UDA对象检测任务上表现出最先进的方法,对我们的DDF方法进行了宽阔的适用性。
translated by 谷歌翻译
对于不同的任务,已经越来越多地研究了一般点云,并且提出了最近的基于变换器的网络,用于点云分析。然而,医疗点云几乎没有相关的作品,这对疾病检测和治疗很重要。在这项工作中,我们提出了专门用于医疗点云的关注模型,即3D医疗点变压器(3Dmedpt),以检查复杂的生物结构。通过增强上下文信息并在查询时总结本地响应,我们的注意模块可以捕获本地上下文和全局内容功能交互。然而,医疗数据的培训样本不足可能导致特征学习差,因此我们应用位置嵌入,以学习准确的局部几何和多图形推理(MGR)来检查通过通道图的全局知识传播,以丰富特征表示。在数据集内进行的实验证明了3DMedpt的优越性,在那里我们达到了最佳分类和分割结果。此外,我们的方法的有希望的泛化能力在一般的3D点云基准测试中验证:ModelNet40和ShapenetPart。代码即将发布。
translated by 谷歌翻译
我们提出了一种新颖的形状意识的关系网络,用于内窥镜粘膜颌下粘膜释放(ESD)手术中的准确和实时地标检测。这项任务具有很大的临床意义,但由于复杂的手术环境中出血,照明反射和运动模糊而极其挑战。与现有解决方案相比,通过使用复杂的聚合方案忽略靶向对象之间的几何关系或捕获关系,所提出的网络能够实现令人满意的精度,同时通过充分利用地标之间的空间关系来保持实时性能。我们首先设计一种算法来自动生成关系关键点热量表,其能够直观地代表地标之间的空间关系的先验知识,而无需使用任何额外的手动注释工作。然后,我们开发两个互补正规计划,以逐步将先验知识纳入培训过程。虽然一个方案通过多任务学习引入像素级正则化,但另一个方案通过利用新设计的分组的一致性评估器来实现全局级正则化,该评估将关系约束以越野方式添加到所提出的网络。这两个方案都有利于训练模型,并且可以随时推动才能卸载,以实现实时检测。我们建立了一个大型内部数据集的ESD手术,用于食管癌,以验证我们提出的方法的有效性。广泛的实验结果表明,我们的方法在准确性和效率方面优于最先进的方法,更快地实现了更好的检测结果。在两个下游应用的有希望的结果进一步证实了我们在ESD临床实践中的方法的巨大潜力。
translated by 谷歌翻译
Domain adaptive detection aims to improve the generalization of detectors on target domain. To reduce discrepancy in feature distributions between two domains, recent approaches achieve domain adaption through feature alignment in different granularities via adversarial learning. However, they neglect the relationship between multiple granularities and different features in alignment, degrading detection. Addressing this, we introduce a unified multi-granularity alignment (MGA)-based detection framework for domain-invariant feature learning. The key is to encode the dependencies across different granularities including pixel-, instance-, and category-levels simultaneously to align two domains. Specifically, based on pixel-level features, we first develop an omni-scale gated fusion (OSGF) module to aggregate discriminative representations of instances with scale-aware convolutions, leading to robust multi-scale detection. Besides, we introduce multi-granularity discriminators to identify where, either source or target domains, different granularities of samples come from. Note that, MGA not only leverages instance discriminability in different categories but also exploits category consistency between two domains for detection. Furthermore, we present an adaptive exponential moving average (AEMA) strategy that explores model assessments for model update to improve pseudo labels and alleviate local misalignment problem, boosting detection robustness. Extensive experiments on multiple domain adaption scenarios validate the superiority of MGA over other approaches on FCOS and Faster R-CNN detectors. Code will be released at https://github.com/tiankongzhang/MGA.
translated by 谷歌翻译
Recent investigations on rotation invariance for 3D point clouds have been devoted to devising rotation-invariant feature descriptors or learning canonical spaces where objects are semantically aligned. Examinations of learning frameworks for invariance have seldom been looked into. In this work, we review rotation invariance in terms of point cloud registration and propose an effective framework for rotation invariance learning via three sequential stages, namely rotation-invariant shape encoding, aligned feature integration, and deep feature registration. We first encode shape descriptors constructed with respect to reference frames defined over different scales, e.g., local patches and global topology, to generate rotation-invariant latent shape codes. Within the integration stage, we propose Aligned Integration Transformer to produce a discriminative feature representation by integrating point-wise self- and cross-relations established within the shape codes. Meanwhile, we adopt rigid transformations between reference frames to align the shape codes for feature consistency across different scales. Finally, the deep integrated feature is registered to both rotation-invariant shape codes to maximize feature similarities, such that rotation invariance of the integrated feature is preserved and shared semantic information is implicitly extracted from shape codes. Experimental results on 3D shape classification, part segmentation, and retrieval tasks prove the feasibility of our work. Our project page is released at: https://rotation3d.github.io/.
translated by 谷歌翻译
With the attention mechanism, transformers achieve significant empirical successes. Despite the intuitive understanding that transformers perform relational inference over long sequences to produce desirable representations, we lack a rigorous theory on how the attention mechanism achieves it. In particular, several intriguing questions remain open: (a) What makes a desirable representation? (b) How does the attention mechanism infer the desirable representation within the forward pass? (c) How does a pretraining procedure learn to infer the desirable representation through the backward pass? We observe that, as is the case in BERT and ViT, input tokens are often exchangeable since they already include positional encodings. The notion of exchangeability induces a latent variable model that is invariant to input sizes, which enables our theoretical analysis. - To answer (a) on representation, we establish the existence of a sufficient and minimal representation of input tokens. In particular, such a representation instantiates the posterior distribution of the latent variable given input tokens, which plays a central role in predicting output labels and solving downstream tasks. - To answer (b) on inference, we prove that attention with the desired parameter infers the latent posterior up to an approximation error, which is decreasing in input sizes. In detail, we quantify how attention approximates the conditional mean of the value given the key, which characterizes how it performs relational inference over long sequences. - To answer (c) on learning, we prove that both supervised and self-supervised objectives allow empirical risk minimization to learn the desired parameter up to a generalization error, which is independent of input sizes. Particularly, in the self-supervised setting, we identify a condition number that is pivotal to solving downstream tasks.
translated by 谷歌翻译
In the new era of personalization, learning the heterogeneous treatment effect (HTE) becomes an inevitable trend with numerous applications. Yet, most existing HTE estimation methods focus on independently and identically distributed observations and cannot handle the non-stationarity and temporal dependency in the common panel data setting. The treatment evaluators developed for panel data, on the other hand, typically ignore the individualized information. To fill the gap, in this paper, we initialize the study of HTE estimation in panel data. Under different assumptions for HTE identifiability, we propose the corresponding heterogeneous one-side and two-side synthetic learner, namely H1SL and H2SL, by leveraging the state-of-the-art HTE estimator for non-panel data and generalizing the synthetic control method that allows flexible data generating process. We establish the convergence rates of the proposed estimators. The superior performance of the proposed methods over existing ones is demonstrated by extensive numerical studies.
translated by 谷歌翻译